libhashckpt: Hash-Based Incremental Checkpointing Using GPU's
نویسندگان
چکیده
Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we introduce libhashckpt; a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads, we show the merit of this technique for a certain class of HPC applications.
منابع مشابه
Accelerating incremental checkpointing for extreme-scale computing
Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to ...
متن کاملTaking Point Decision Mechanism for Page-level Incremental Checkpointing based on Cost Analysis of Process Execution Time
Incremental checkpointing, which is intended to minimize checkpointing overhead, saves only the modified pages of a process. This means that in incremental checkpointing, the time consumed for checkpointing varies according to the amount of modified pages. Thus, efficient intervals of checkpointing have to be determined on run-time of a process. In this paper, we present an efficient and adapti...
متن کاملIncremental Checkpointing based on Java Source Code Refactoring
In this project, incremental checkpointing is developed specifically for Java programs. This checkpointing scheme has a flavor of source code refactoring, which performs almost all the (rule-based) transformation automatically, requiring few (or no in many cases) interaction with the programmer. Incremental checkpointing bases on a logging technique that records the change in states instead of ...
متن کاملAn Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment
Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...
متن کاملCheckpointing in Oracle
Checkpointing is an important mechanism for limiting crash recovery times. This paper describes a new checkpointing algorithm that was implemented in Oracle 8.0. This algorithm efJiciently JWS buffers which need to be written for checkpointing and easily scales to very large buffer cache sizes: it has been tested with buffer caches as large as six million buffers. Based on this algorithm, we ha...
متن کامل